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Abstract 

Background: Biopsies taken from individual tumours exhibit extensive differences in their cellular composition due 
to the inherent heterogeneity of cancers and vagaries of sample collection. As a result genes expressed in specific 
cell types, or associated with certain biological processes are detected at widely variable levels across samples in 
transcriptomic analyses. This heterogeneity also means that the level of expression of genes expressed specifically 
in a given cell type or process, will vary in line with the number of those cells within samples or activity of the 
pathway, and will therefore be correlated in their expression. 

Results: Using a novel 3D network-based approach we have analysed six large human cancer microarray datasets 
derived from more than 1,000 individuals. Based upon this analysis, and without needing to isolate the individual 
cells, we have defined a broad spectrum of cell-type and pathway-specific gene signatures present in cancer 
expression data which were also found to be largely conserved in a number of independent datasets. 

Conclusions: The conserved signature of the tumour-associated macrophage is shown to be largely-independent 
of tumour cell type. All stromal cell signatures have some degree of correlation with each other, since they must all 
be inversely correlated with the tumour component. However, viewed in the context of established tumours, the 
interactions between stromal components appear to be multifactorial given the level of one component e.g. 
vasculature, does not correlate tightly with another, such as the macrophage. 
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Background profiles and others to stratify patients into the most 

In recent years the field of cancer research has seen an appropriate treatment group. The latter whilst not yet 
increasing number of large gene expression studies of driving therapeutic options, clearly has potential implica- 
primary human tumours. Analysis of these datasets has tions for individualised patient therapy [4,7]. These studies 
tended to focus on the identification of markers able to have generally focused on the identification of groups of 
divide disease samples into prognostically relevant classi- differentially expressed genes that can be used to divide 
fications [1-5]. In the seminal paper by Alizadeh et al tumours into subgroups using statistical approaches. 
[6] they were able to subdivide lymphomas on the basis Such predictive gene signatures are frequently composed 
of their gene expression profiles and thereby associate of genes with no obvious shared biological function, 
specific genes with the tumour s clinical characteristics. Indeed, there may be a number of signatures derived for 
Subsequently, numerous studies have attempted to classify essentially the same purpose that share few if any genes in 
other tumour types based on their gene expression common [8]. 

An alternative approach is to generate signatures that 
reflect a specific biological process or outcome [9-11] or 
sets of coexpressed genes based upon correlation matrices 
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gene expression data is the heterogeneity of samples. The 
tumour cells themselves differ not only in the nature of 
the mutation(s) that have driven them, but also the geno- 
type of the patient and the treatment that they have 
received. A significant proportion of a tumour mass is 
comprised of stromal cells [15]; these non-transformed 
cells forming the microenvironment in which tumour cell 
growth is contained and supported. Indeed, the tumour 
stroma is increasingly seen as an alternative target for 
therapeutics with potential treatments targeting angiogen- 
esis [16,17], the extracellular matrix [18] or immune cells 
[16,19]. 

One approach to analysis of the cancer versus the 
stromal components in a tumour is to employ laser cap- 
ture microdissection e.g. [20]. Here we present an in 
silico approach to dissecting the expression profiles 
of individual cell types in the tumour stroma, as well as 
the cancer cell component. We have developed a com- 
putational framework and associated tool that now 
supports visualization and clustering of very large correl- 
ation networks derived from microarray expression data 
[21,22]. The approach takes advantage of the heterogen- 
eity of tumour samples. The underlying premise of this 
approach is that the expression of genes specifically as- 
sociated with a given cell type or pathway will increase 
or decrease with the relative abundance or activity of 
those cells/pathways within a given sample, either 



because of genuine biological variation or random sam- 
pling of different regions of the tumours (Figure 1). The 
relative significance of correlation increases with the size 
of the dataset, as the probability of coexpression occur- 
ring by chance decreases. Modelling these associations 
as a graph brings together groups of functionally associ- 
ated genes which share similar expression profiles such 
that they form cliques of high connectivity in a graph. 
We have recently confirmed this hypothesis through the 
meta-analysis of large collections of expression data de- 
rived from many different populations of mouse cells 
[23,24], pig tissues [25] and clinically derived samples 
[26]. Here we demonstrate using individual cancer 
datasets that global expression patterns can be divided 
into biologically meaningful clusters defining tumour 
cell and stromal elements, and also that many of these 
gene signatures are conserved across multiple unrelated 
human cancer datasets. 

Results and discussion 

Network topology 

Following download of all cancer data it was subjected 
to rigorous QC as poor quality data can have a signifi- 
cant and detrimental effect on correlation network top- 
ology. Only data that passed QC was used to construct 
the large networks described here. Network topology 
was visibly complex (Figure 2) and unsupervised cluster 
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Figure 1 Rationale behind the study. The relative number of a specific cell type or activity of certain pathways will vary across a 
collection of individual tumours. For example, the macrophage content (0) will differ in every tumour and so therefore will the mRNA level of 
macrophage specific genes (in blue). Similarly in every tumour at the point it is sampled the number of cells in mitosis (the mitotic index) will 
differ and this will be reflected in different levels of expression of cell cycle genes (in red). As a result the expression level of genes specifically 
expressed by those cells or associated specifically with the pathways will vary accordingly. By calculating the correlation coefficient between 
every gene on the array and every other gene on the array it is possible to calculate a correlation matrix that includes all these correlation 
coefficients. Graphs are then used to visualise relationships above a given correlation threshold and clustering used identifying groups of 
co-expressed genes. 
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Figure 2 Network graphs derived from six cancer datasets; a) breast carcinoma, b) colorectal carcinoma, c) DLBCL, d) glioma, 

e) ovarian carcinoma and f) testicular germ cell tumours. Each dataset studied had its own idiosyncrasies owing to the tumour-specific biology 

each represents and the high degree of inherent variation in gene expression data derived from cancer samples. In order to visualise and analyse 

a large proportion of the expressed genes in these samples we aimed to construct graphs of data derived from individual cancer types based on 

just under half the transcripts represented on the chip (18,000-23,000 probesets). For these reasons relatively low correlation thresholds were used 

for graph construction i.e. between r = 0.65-0.75. The resultant graphs of individual cancer datasets are highly structured and composed of a large 

number of nodes and edges (18,934 - 23,015 nodes, connected by between 268,471 - 954,082 edges, see Table 1 for details), 
k. J 



analysis using the Markov clustering (MCL) algorithm 
[27] defined cliques of highly connected nodes in all 
graphs. Each clique (cluster) represented transcripts 
whose expression patterns were highly correlated across 
the dataset. These were surrounded and linked by 
sparser network structures. GO and pathway enrichment 
analysis was able to demonstrate functional enrichment 
in many of the clusters, but was generally poor in identi- 
fying clusters associated with the specific cell types from 
which some of these signatures were clearly derived. We 
therefore supplemented this analysis by comparison with 
gene sets (clusters) derived from datasets of isolated tis- 
sues and purified cell populations [28,29] and mining of 
the literature. All graphs described in this work are avail- 
able from the website www.OncoGraph.org which sup- 
ports the direct visualization of graphs in BioLayout 
Express 3 ^ using Java web start technology. 

Technical replicates and functionally related genes are 
closely associated in the graph network 

Data derived from different probesets designed to the 
same gene, genes in the same loci and functionally re- 
lated genes were frequently found to be connected 
within the graphs. For example in the testicular dataset 
the six probesets designed to haemoglobin alpha (HBA1) 
and three designed to haemoglobin beta (HBB) clustered 
together reflecting the known co-regulation of these loci 



(Figure 3a). Likewise the multiple probesets for the 
growth hormone genes, GH1 and GH2, were closely as- 
sociated within the testicular network graph alongside 
the chorionic somatomammotropin hormones (CSH1, 
CSH2, CSHL1) which are all sited at the same loci on 
chromosome 17 (Figure 3b). Furthermore, two clusters 
of Hox genes were found to be co-expressed in certain 
testicular cancers (Figure 3c). Finally, markers of mono- 
cyte/macrophage populations CD 14, CSF1R and CD 163 
were all expressed at highly variable levels across indi- 
vidual tumours but exhibited very similar profiles across 
the dataset. Indeed, these markers were always closely 
associated within all graphs and were usually all present 
in a single MCL cluster. Furthermore, these clusters 
were also enriched with many other genes also known to 
be expressed in macrophages (for example gene list of a 
macrophage cluster derived from DLBCL, see Additional 
file 1). In a cluster such as this in which many of the 
genes can be associated with a given cell type, the cluster 
provides a unique insight into the functional profile of 
the cell within the tumour. Although many of the genes 
in such clusters are not recognised markers of these cells 
and are therefore only characterised as such by the 
principle of 'guilt by association', they may be of signifi- 
cant interest in terms of defining the functional activity 
of cells or as potential targets for manipulating cell 
function. 
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Figure 3 Clusters of transcripts (left) derived from the testicular cancer dataset and (right) associated expression profiles (average 
signal per gene or cluster). The individual tumours (represented along the y-axis) were grouped by mixed (green in upper bar) or pure 
histological subtype (red in upper bar) and then by components present (coloured blocks in lower bar) a) Haemoglobin cluster containing data 
derived from 5 probesets designed to HBA1 and 3 designed to HBB. The haemoglobin locus cluster is found to be present in many human 
expression datasets and is often unconnected to any other genes, b) Cluster of somatotrophin genes whose expression is normally tissue specific 
and limited to pituitary and placenta, shown here to be expressed predominately in tumours containing elements of choriocarcinoma, a tumour 
formed of malignant trophoblast cells, the normal equivalent of which are involved in placenta formation, c) HOX genes found to be expressed 
only in two teratomas with secondary carcinomatous transformation. Interestingly one of these tumours shows a high degree of up regulation of 
one group of primarily HOXB genes and the other a mix of HOXA/C/D genes. 



Individual tumour datasets form unique network 
structures related to the specific mix of cell populations 

In previous studies of mouse primary cells and tissues 
[21,23,24] we were able to identify clusters associated 
with specific cell populations and others that reflected 
particular cell functions. This was possible because cells 
vary in their relative activity of different aspects of cell 
biology e.g. growth and proliferation, metabolism, pro- 
tein synthesis and secretion, endocytosis, motility etc. 
Similarly, in this analysis many clusters showed clear 
functional enrichment of genes encoding proteins asso- 
ciated with a cell-specific pathway or cellular process 
(Figure 4). Some clusters were common to all datasets 
and some, such as neuronal signatures in the glioma 
dataset or tissue signatures in teratomas, were specific to 
individual tumours. Across all tumour types there were 
closely related clusters of genes associated with cell cycle 
progression, similar to cell cycle signatures observed 
previously in other datasets [30,31]. The profile of these 
genes reflected the known relative proliferative rate of 
the tumour with, for example, expression at a higher 
level in aggressive types of ovarian tumours compared to 
those of a low malignant potential. A prominent cluster 
found in all graphs was formed of genes associated with 
the extracellular matrix (ECM); an expression signature 
also evident in mouse data and analysed with respect to 
human connective tissue-related diseases [32]. Expres- 
sion of genes in this cluster was higher in tumours 
characterised by a fibrotic stroma, for example 



expression of these genes being elevated in primary me- 
diastinal lymphoma (PMBL) as compared to other 
subtypes of DLBCL. Other small clusters contained 
known functional markers of endothelial cells (PECAM1, 
EMCN, ESAM, VWF), smooth muscle cells (CNN1, 
MYH11, TPM2) and adipocytes (AQP7, CD36, FABP4, 
LPL). However, the predominant stromal expression 
signatures are clearly associated with specific cells or ac- 
tivities of the immune system. Clusters enriched for 
markers of the monocyte/macrophage (CD14, CD163, 
CSF1R), T cell (CD2, CDS, CD7, GZMA), and interferon 
response (IFU-3, MX1/2, OAS 1-3) were present in all 
tumour types studied here, and their composition was 
essentially tumour-type independent. There was no evi- 
dence of skewing of the macrophage profile towards any 
particular phenotype. B cell-specific markers and related 
genes were present in the immune-related networks of 
some cancers but not others. In all tumours there was a 
prominent signature representing a post-germinal centre 
B cell/plasma cell which was rich in immunoglobulin 
genes as well as markers of post-germinal centre B cells 
such as IRF4. A neutrophil signature was identified only 
in the colorectal cancer dataset, reflecting the presence 
histologically of neutrophils in this tumour type. 

In all of the graphs there were also many clusters of 
genes showing relatively uniform expression across sam- 
ples. Many of the genes that formed these clusters were 
poorly annotated limiting the possibility of assigning any 
functional linkage between them but are likely to play a 
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Figure 4 Network derived from testicular cancer dataset (Pearson correlation threshold r = 0.75). a) Network with only edges showing 
allowing visualization of the inherent complex topology of the graph and b) with nodes shown where nodes are coloured according to cluster 
membership, c) Graph with selected clusters shown and d) the average expression profile of genes within those clusters. (Cluster colour code is 
maintained across graphs in this figure). Cluster 68 is highly enriched with endothelial marker genes and cluster 23 contains many transcripts 
known to be associated with extracellular matrix. The last three selected clusters can be associated with different aspects of the immune infiltrate 
in these tumours. Cluster 2 contains many know markers of T-cell and B-cells, cluster 4 (also shown in Figure 2) is enriched with many known 
macrophage expressed genes and cluster 10 is highly enriched with many interferon response genes. 



role in uncharacterised cellular housekeeping functions. 
In mouse data also, cell lineage-specific or inducible ex- 
pression of genes are associated with more informative 
annotation, reflecting the priorities of studies on gene 
function [23]. 

Disease networks 

Recognised markers associated with specific cancer sub- 
types often did not fall within large clusters, presumably 
because they are not highly correlated with global bio- 
logical features of cancer cells or the associated stroma. 
For example in the breast cancer dataset, there was a 
small component of the graph representing genes whose 
expression is lower in basal-like and ERBB2-positive tu- 
mours (Additional file 2). This component included 
ESR1 (oestrogen receptor alpha), GATA3, FOXA1 and 
XBP1, and therefore appears to capture a significant pro- 
portion of the oestrogen signalling transcriptional net- 
work, including modulators and downstream targets. 
GATA3 is involved in luminal differentiation in normal 
breast tissue [33] and ESR1 and GATA3 have been 
demonstrated to reciprocally regulate each other [34]. 
GATA3 is an inducer of FOXA1, which in turn can in- 
duce XBP1, while ESR1 acts both upstream and along- 
side FOXA1 (reviewed in [35]). Similarly, in the DLBCL 
dataset, IRF4, one of the markers of the ABC-subtype 
[6] lies in a sparse network on the edge of the graph, 
whose nearest neighbours include FOXP1, PIM2 and 



CARD 11, all described to be up-regulated in ABC- 
subtype of DLBCL, with amplifications or mutation 
affecting FOXP1 [36] and CARD 11 [36] identified 
in 38% and 10% respectively of tumours studied 
(Additional file 3). In both cases it would appear that the 
graphs have accurately identified key disease modules as- 
sociated with ESR1 or IRF4 and other genes lying in the 
immediate neighbourhood merit further investigation. 
An additional disease module [37], was associated with 
SILV in the skin cancer data [38] . The immediate neigh- 
bours of SILV, a gene whose product pMell7 is used 
clinically in the diagnosis of melanoma [39], includes 
TYR (tyrosinase) the key enzyme in melanin biosyn- 
thesis, MLPH (melanophilin), which plays a role in 
melanosome transport, GPR143 expressed on the mela- 
nosome membrane and ML ANA (melan-a). Also present 
was MITF, a melanocytic transcription factor and tran- 
scriptional regulator of many of these genes (reviewed in 
[40]) and SNCA (alpha-synuclein). 

Networks constructed based on the mean Pearson 
correlation values across multiple datasets identify 
conserved transcriptional signatures 

Having examined gene expression signatures within 
individual datasets, we questioned whether these signa- 
tures were preserved across different tumour types. 
Using six datasets (breast, colonic, DLBCL, glioma, ovar- 
ian, testicular) a full correlation matrix was calculated 
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for each and the mean Pearson correlation coefficient 
across the datasets calculated between all combinations 
of probes. Layout of a graph derived from the mean 
Pearson values (r = >0.6) across the six different tumour 
types resulted in a smaller graph than those derived 
from individual tumours at this threshold (Figure 5). 

The topology of this graph broke down into three 
main components which draw together clusters enriched 
in genes associated with broad functional groupings. 
The dominant topological feature contained four of the 



five largest gene clusters. Many of the genes in these 
clusters are poorly characterised but relatively uniformly 
expressed across all samples in all datasets and were 
therefore designated the 'house-keeping' (HK) clusters. 
In general these clusters were poorly conserved in indi- 
vidual datasets although areas of the graphs enriched in 
housekeeping clusters were clearly identifiable. 

A second portion of the graph was highly enriched 
with genes encoding cell-cycle or cell-cycle related 
proteins. Cluster 6 contains multiple cyclins, kinesins, 
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Figure 5 Network graph of conserved transcription signatures in cancer, a) 3D graph layout with labelling of main features in the network's 
topology. A graph of 9,882 nodes and 184,563 edges was created at Pearson threshold r>0.6. Clustering of the graph using the MCL algorithm 
resulted in 639 clusters ranging in size from 1,008 nodes to 4 nodes. A number of large clusters were shown to be highly enriched in genes 
associated with expression in individual cell types and/or associated with specific cellular functions. See Additional files 3 and 4 for details 
b) Collapsed cluster diagram showing the conserved gene network as a simplified 2D-network where single nodes represent a gene cluster and 
are sized according to the number of transcripts within the cluster, edges represent connections between members of each cluster. Nodes 
representing the main clusters have been coloured according to functional groupings. Blue - clusters represent those associated with 
housekeeping functions; green - clusters of genes which are directly involved in cell cycle progression or whose expression is in way some linked 
to it; pink - genes associated with the immune component of the tumour and yellow - other stromal elements. Smaller clusters enriched with 
genes of known function are also shown. 
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members of the minichromosome-maintenance-complex, 
E2F transcription factor family members, DNA polymer- 
ases and topoisomerase. The Aurora kinases {AURKA, 
AURKB), BUB1 and both checkpoint proteins (CHEK1 
and 2) are also present along with many other genes 
that have previously been associated with the cell cycle 
(see Additional file 4 for the full list of genes/clusters 
and Additional file 5 for an enrichment analysis of 
these clusters). Associated with the cell cycle cluster 
are further smaller clusters e.g. clusters 10, 16 and 26 
enriched in mitochondrial, ribosomal and glycolysis- 
related genes. 

The third main area of the graph is clearly associated 
with different elements of the tumour stroma with a 
number of immune-related gene clusters in close prox- 
imity to each other and those representing other stromal 
components being somewhat more distant. The macro- 
phage cluster (cluster 7) from the combined cancer 
graph contains many genes considered to be specific to 
the myeloid lineage including CD68, CD 14 and CSF1R. 
There is also enrichment for lysosomal genes, multiple 
genes involved in chemotaxis, and multiple toll-like re- 
ceptors as well as scavenger receptors CD 163, MARCO, 
MSR1 which have previously been described by many 
groups as expressed in tumour-associated macrophages 
(see [41] for a review). Also, within the macrophage 
cluster are multiple components of the MHC class II 
antigen processing machinery. Interestingly, among the 
genes also expressed is CD86, the co-stimulatory mol- 
ecule, suggesting that these cells may be able to effi- 
ciently present antigen to T cells. The T cell cluster 
(cluster 8) contains pan-T cell markers (CD2, CD3, CD7) 
and elements of the T cell receptor signalling cascade 
(ZAP70, LCK, VAV, ITK). There are many chemokines, 
cytokines and their receptors in the cluster (CXCL9, 
CCL1 9, CCLS, LTB, CXCR3, CXCR6, CCR7, CCR2, 
CCR5, 1L2RB, 1L2RG, 1L17R, 1L10RA) including also 
interferon gamma (IFNG), the prototypical classical' 
macrophage activator. The T cell signature is suggestive 
of an active state with expression of cytotoxic molecules 
granzymes and perforin as well as markers of activation 
(CD69). Lying adjacent to the T cell and macrophage 
clusters is a cluster of genes many of which have been 
associated with an interferon response containing ele- 
ments of the proteasome and multiple interferon regula- 
tory factors and interferon inducible proteins. 

The largest non-immune-related element of the stro- 
mal signature is a cluster of genes associated with extra- 
cellular matrix which are almost certainly expressed 
specifically in fibroblasts/myofibroblasts. It contains 
structural proteins including many collagens as well as 
cadherins, laminins, fibrillin and integrins. The signature 
also contains modifiers of the extracellular matrix 
such as MMP2, LOXL1, ADAMTS12, ADAMTS2 and 



receptors for growth factors (PDGFRB) and shares a 
high degree of overlap with the ECM signature derived 
from mouse [23,32]. The vascular signature fragments 
into four small but closely aligned clusters, three of 
which appear to represent endothelium and the fourth 
associated with the basement membrane/extracellular 
matrix component. These clusters contain classical and 
well characterised markers of vascular differentiation 
such as PECAM1 (CD31), CD34, VWF, KDR and CDHS. 
In addition, they contain many genes that have been 
identified as endothelial specific genes by alternative bio- 
informatic analysis approaches (ECSCR, EMCN, ROB04, 
TEK, EPAS1, GPR116) [42-45], components of the 
Notch signalling pathway and other endothelial genes 
which have been demonstrated in normal and tumour 
associated endothelium such as PLVAP [46]. Finally 
there is a small cluster that contains many adipocyte 
specific genes including ADH1B, AD1POQ, FABP4 and 
LPL. Other small clusters or groupings of small clusters 
of note contain the Affymetrix control probes, histone 
complexes, API transcription factors/early response 
genes (/UN, JUNB, FOS, EGR1, EGR3, 1ER2, NR4A1, 
NR4A2, ATF3, CTGF and DUSP1) and as mentioned 
previously somatotrophins (GH1, CSH1, CSH2, CSHL2). 

Core signatures are conserved in an unrelated dataset 

In order to confirm that the core' transcriptional signa- 
tures generated from the meta-analysis of six datasets 
are conserved in other cancer datasets, we mapped the 
signatures onto a number of completely independent 
tumour datasets derived from skin/melanoma [38], gas- 
tric cancer [47] and Hodgkin lymphoma [48]. In each 
case clusters derived from the meta-analysis of the six 
tumours identified corresponding clusters in these inde- 
pendent datasets. Shown here are the results of their 
comparison to a dataset consisting of primary skin can- 
cers including basal cell carcinomas (BCC), squamous 
carcinomas (SCC) and melanomas, plus a number of 
metastatic melanomas [38]. Like the other independent 
datasets, this contained unique transcriptional signatures 
corresponding to the different tumour types represented 
in this dataset (Figure 6). However the core signatures 
were clearly also present. For example, cluster 16 (desig- 
nated 'macrophage') in the skin cancer dataset was 
highly significantly enriched for genes found in the 
macrophage cluster in the merged' dataset (cluster 7 in 
Figure 5) (adjusted p-value = 1.3 E " 120 ) implying that these 
genes represent a true 'functional unit', in this case a cell 
signature. Similarly cell cycle, stromal and house- 
keeping clusters were also conserved in the skin cancer 
data (Table 1) and all other cancer datasets so far exam- 
ined have all generated networks where the conserved 
signatures identified here have been found to be present. 
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Figure 6 Conservation of transcriptional signatures in graph derived from skin cancer dataset (Pearson correlation threshold r = 0.80). 

In order to provide a clear view of transcripts/clusters within skin cancer graph it has been simplified. The network shown here has been 
constructed with a central framework of edges derived from the relationships between clusters and nodes representing the transcripts in each 
cluster joined to a central node representing the cluster with the graph laid out in 2D. Only clusters comprising more than 8 probesets were 
included, a) Colours represent different clusters in the skin cancer data, and b) overlay of clusters from the merged cancer (r = 0.6) graph 
displayed using larger nodes. Many of the housekeeping clusters (1-5, 1 1) can be seen to be conserved, as is a proportion of the cell cycle (6), 
macrophage (7), T-cell (8), ECM (9), interferon response (12), plasma cell (14), MHC class 1 (19), histones (20) and Affymetrix control (23) clusters. 
However it can also be seen that many of the skin cancer clusters are not represented in the merged cancer profile set, these transcriptional 
signatures being unique to skin cancers. 



This study demonstrates the feasibility of an in silico 
alternative to laser capture microscopy to identify the 
gene expression profiles of the cells that make up a 
tumour. Understanding the microenvironment of the 
tumour allows exploration of potential new targets for 
therapy, directed not at the malignant cells, but at the 
environment in which they exist. Study of these cells is 
complicated by the heterogeneous background in which 
they exist and isolation of the cells from their back- 
ground, unless by approaches such as microdissection, 
will inevitably change them. Any clinically relevant 
approach to the microenvironment must of necessity, 
address the microenvironment of the established 
tumour, rather than the factors that contribute to 
tumour development. We have used BioLayout 
Express 3 ^ a visualization tool that allows exploration of 
complex networks of a size not previously possible [21]. 
Furthermore, we have used the MCL algorithm [27] to 
group nodes (genes) into clusters in a completely un- 
supervised manner. In this respect MCL has been shown 
to perform as well or better than other network cluster- 
ing algorithms [49]. Previous efforts to identify gene 
signatures/modules in cancer data have used different 
analytical approaches, less data, grouped far fewer genes 
and often failed to explain the biological significance of 
their findings [9-11,50,51]. Where correlation networks 
have been used previously to analyse modularity in gene 
expression data [13,52], available computing frameworks 
have not permitted the visualization or exploration of 
the resultant graphs as effectively in the current study. 



The robustness across different datasets, and the obvious 
association of genes of known function or cell lineage- 
restriction, provides a strong internal validation for our 
approach. 

The preservation of specific clusters associated with 
stromal cells across such a large number of genetically 
diverse individuals and multiple tumour types argues 
there is a common tumour microenvironment that con- 
trols, and is controlled by, interactions amongst ele- 
ments of the stroma. There is already a wealth of data 
on the role of the tumour associated macrophage 
(TAM), with the majority of studies suggesting that large 
numbers of TAMs are associated with poor prognosis 
(reviewed in [53,54]). Macrophages have been attributed 
functions in assisting invasion, promoting angiogenesis 
and subverting an immune response to the advantage of 
the tumour. To date there are only global gene expres- 
sion profiles from TAMs derived from inbred mouse 
tumour models in which the cells have been separated 
from their microenvironment and therefore potentially 
had their gene expression altered by the process of 
isolation. Our core macrophage signature contains genes 
involved in phagocytosis, MHC class II antigen presenta- 
tion and T cell co-stimulation and is therefore suggestive 
of a macrophage acting as an antigen-presenting cell. 
The TAM profile also contains scavenger receptors and 
genes involved in lipid metabolism suggesting a role in 
apoptotic cell clearance by TAMs. Analysis of the T cell 
profile demonstrates the presence of an almost com- 
pletely intact antigen recognition and signalling pathway 



Table 1 Summary of the gene coexpression clusters conserved across all datasets studied here 



Descriptive Cluster 
class description 



Immune Macrophage 



Tcell 



Cluster ID No. of No. of genes Known markers 

number probesets in cluster present in cluster 

in cluster 



Gene ontology 
annotation 



220 



181 



163 CD68, CD14, CD163, CSF1R, Fc 
Receptors (CD16, CD32, CD64), 
MHC II molecules 



1 45 CD2, CD3, CD6, CD7, CD52, 
TCR 



(p-value for 
EASE score) 

Immune system process 
(3.96 e " 32) 

Defence response 
(8.79 e ~ 22 ) 

Immune system process 
(4.99 e ~ 35 ) 

Signal transduction 

(7.32 e " 18 ) 

T cell activation (2.58 e ~ 



Other annotation 

* KEGG pathway, ** 
Curated gene set *** 
Swiss-Prot 

Keywords (p-value) 



TCR signalling pathway 
(6.1 e " 25 )* 



Conservation of Conservation of 

signature in skin signature in skin 

dataset: Cluster dataset: significance 

number of enrichment 

(Adjusted 
Fisher's test) 

~^T~ 1.3 E " 120 



3.29 b 



Macrophage^ 
cell interface 

IFN response 



MHC class I 
lg/ plasma cell 
B-cell 



Mast cell 
AP1 response 



Stroma Extracellular matrix 



Adipocyte 



13 
12 

19 
14 



93 



87 
115 

35 
85 



79,142 10,7 



9 

13, 9 



163 



27 22 



58 
73 

16 

36 
6, 6 



4 
11,6 



100 



15 



GBP1. IFI27, IFR1, IRF2, OAS1, 
SP100, STAT1, 



HLA-A, HLA-B, HLA-C, HLA-E, 
B2M 

lg light and heavy chains, 
CD19, CD20, CD79 



Tryptase, Fc receptor for IgE 
FOS, JUNB, 



BGN, CALD1, FN1,collagens 



ADIPOQ, LPL 



Immune response 
(7.20 e "° 6 ) 

Immune response 
(4.32 e " 26 ) 

Response to virus 
(1.1 5 e " 21 ) 

Antigen processing and 
presentation (5.51 6-1 7 ) 

Immune response 

(7.49 6 " 10 ) 

B-cell receptor complex 
(2.3 1 e " 06 ) 

Immune response 
(1.84 e " 08 ) 

Proteolysis (0.008) 

Sequence specific DNA 
binding (2.1 9 e " 06 ) 

Regulation of cellular 
process (1.2 e " 04 ) 

Extracellular matrix 

(1.93 e " 40 ) 

Cell adhesion (1.16 6 " 17 ) 

Response to wounding 
(0.004) 



34 



Genes upregulated by IFNB 51 
in HT1080 (1.48 e ~ 47 )** 

Genes upregulated by IFNA 
in HT1080 (1.05 e ~ 43 )** 



MHC 1 (9.29 e 



lg C region (1.12 6 " 10 ) 



Zymogen (4.77 6 " 04 )* 
DNA binding (2.75 e ~ 



PPAR signalling (1.46 e "° 7 )* 

Adipocyte vs Fibroblast 
upregulated (1.49 e ~ 15 )** 



82 



21 



BCR signalling (1.28 e "°T 6 



167 



2.6"' 
4.06 E " 

1.2 E " 5 
2.48 E " 
6.52 E " 

2.72 E " 



5.5" 



Table 1 Summary of the gene coexpression clusters conserved across all datasets studied here (Continued) 



Endothelium 29,38,49 20, 17, 13 17, 13, 11 CD31, CD34, Endomucin, 

Endoglin, vWF 



Endothelium/ECM 59 11 6 COL4A1, COL4A2 

Smooth muscle 88, 249 9,6 5, 4 Alpha SMA, calponin 



Skeletal muscle 



46 



15 



15 Myoglobin, CKm, Myosin 



Cell adhesion (1.3 e "° 8 ) 

Blood vessel 
development (1.28 6 " 10 ) 



Cell adhesion (4.1 3 e " 05 ) 

Smooth muscle 
contraction (4.75 e ~ 05 ) 

Contractile fibre part 



Upregulated in glomerul in 58 
DM vs normal (8.33 e ~ 14 )** 

Brentani_Angiogenesis 

(6.87 e " 08 )** 



Muscle protein (3.95 e 



215 
907 



23 



6.75 b 



1 .22"" 
1 .27 E ~ 

2.55 E " 



Muscle development 



Cell cycle Cell cycle 



239 



182 AURKA, BUB1, CHEK2, CDC2, 
MCM2 



Cell cycle related 1 0, 1 6, 26 1 47, 52, 1 25, 44, 1 9 

23 



Ribosomal Ribosomal 



Other Histones 

functional 

classes 



Glycolysis 



Haemoglobins 

Affymetrix Affymetrix 
controls controls 



54,60,64, 12,11,11, 12,8,6,5 RPL38, RPS10, RPS19 
97 8 



20 



47 



30 



13 



26 



91 9 
23, 28 26, 22 



2 

26, 22 



HIST1H1C, HIST1H2AB, 
HIST1H3H 



GAPDH, GPI 
HBA1, HBB 



Serum fibroblast cell cycle 101 
(1.17e-117)** 



Cell cycle (3.06 6 " 59 ) 

DNA replication 

(1.48 6 " 39 ) 



RNA binding (2.29 e ~ 11 ) RNA binding (2.29 e 



1.3T 



Cytosolic ribosome 
(7.91 e " 13 ) 



Nucleosome (9.75 e ~ 23 ) 

Chromatin assembly 
(9.24 6 " 21 ) 

Glycolysis (1.46 e ~ 05 ) 



(2.61 e 



Gluconeogenesis 

(9.37 e ~ 07 )*** 



54 



Ribosomal protein 

(8.44 6 " 11 )*** 

Protein biosynthesis 

(1.62 e ~ 06 )*** 



Nucleosome core M 

1 e-21\^^ 



151 



144 

99 



4.25 t_ 
9.1 5 E_ 

6 E-30 

6 E " 31 
4.57 E_ 



Each cluster or group of related clusters has been placed into a functional grouping based on the biology from which it is derived. Details of the cluster(s) are provided together with selected pathway/Gene ontology 
enrichment scores for the genes that make up the clusters. (For a complete list see Additional file 4). 
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containing elements of the TCR, co-receptors and down- 
stream signalling molecules. Also in the signature are 
cytotoxic molecules and markers of activation suggesting 
these are, at least in part, activated cytotoxic T cells. The 
preservation across all tumour types of a cluster of genes 
associated with an interferon response and the presence 
of IFNG in the T cell signature argues that activation of 
this pathway forms a consistent part of the response to a 
tumour. This is also in keeping with data derived from 
murine models in which it was shown that TAMs 
express many interferon inducible genes [55]. Taken 
together, these data do not support the view that TAMs 
have a so-called M2 (or alternative activation) [19] 
phenotype characterised by dominant actions of 
interleukin 4. Nor does the analysis support the view 
that recognition of tumour-associated antigens is 
compromised by a lack of antigen-presenting cells. 

One of the larger signatures observed is associated 
with the extracellular matrix. This was enriched in struc- 
tural proteins, proteoglycans, modifiers of the extracellu- 
lar matrix and signalling molecules. Histologically, the 
presence of a desmoplastic tumour stroma is a well 
recognised phenomenon occurring in many tumour 
types. However, like many other elements of the micro- 
environment the precise role played by this reactive 
stroma has been difficult to assess: is the role of the 
stroma to contain the tumour or is it yet another factor 
recruited to promote the survival of the malignant cells? 
Recent data from studies of DLBCL suggest that in this 
tumour at least the answer may be that different ele- 
ments of the stroma contribute to both a good and a 
poor prognosis [56], whereas work in small cell carcin- 
oma has established the role of interactions between the 
ECM and tumour cells in resisting chemotherapy-induced 
death [57] . More recently work in a lung carcinoma model 
[58] highlighted the role that ECM components, in this 
case versican, can play in activating other elements of the 
microenvironment suggesting that as for other elements 
of the tumour microenvironment, cross-talk between ele- 
ments is likely to be of great importance. 

The vasculature signature observed here contains 
many well characterised markers of endothelial cells as 
well as less well characterised endothelial genes. It con- 
tains receptors and co-receptors (KDR, NRP2) for VEGF, 
the major angiogenic factor but also contains elements 
associated with Notch signalling, another important sys- 
tem in angiogenesis (for a review see [59]). NOTCH3, 
usually expressed in vascular smooth muscle, lies in the 
endothelial-related cluster enriched in ECM and base- 
ment membrane proteins. A recent study investigated 
the crosstalk between endothelial and mural cells via 
NOTCH3 signalling and showed a reduction in angio- 
genesis in an in vitro co-culture system when NOTCH3 
is knocked out in mural cells [60]. 



The fact that macrophage, T cell, ECM and 
endothelial-specific genes form independent clusters, in- 
dicates that there is not a tight causal relationship be- 
tween them. The T cell signature and macrophage 
signatures are to some extent correlated in all of the net- 
works, but this does not necessarily imply an interaction 
beyond the fact that the most fibrotic regions of a 
tumour tend to exclude leukocytes so the two cell types 
could be co-enriched by chance. Despite the reported as- 
sociation of macrophage number with microvessel dens- 
ity in some solid tumours [61-64], the signatures of 
macrophages and endothelial cells are clearly separate, 
so there is not likely to a strict macrophage requirement 
for angiogenesis, and in fact the drive to angiogenesis is 
likely to be multifactorial. This viewpoint is supported 
by the fact that the macrophage cluster does not contain 
any of the known regulators of endothelial proliferation, 
such as the vascular endothelial growth factors (VEGFs). 
Neither macrophage nor endothelial cluster contains 
TIE2, which has been implicated, based largely upon 
in vitro studies, in tumour-associated angiogenesis and 
regulatory T cell production [65,66]. Indeed, the T cell 
cluster does not contain FOXP3 or CD25, indicating that 
regulatory T cell activation is not a ubiquitous feature of 
the immune environment of tumours. 

Conclusion 

In summary, we have demonstrated a unique approach 
to phenotyping cell types and identifying pathways 
within cancer without the need for technologies such as 
microdissection. The approach is related to the views of 
pathways, interaction and functional relationship that 
can be derived from analysis of introduced genetic vari- 
ation in yeast [67]. The core signatures we report pro- 
vide a tool to aid the analysis of further datasets, and 
using the tool BioLayout Express 3 ^ they can readily be 
overlaid on to other data as an aid to the interpretation 
of other large-scale expression data. In considering 
therapeutic approaches to cancer, our approach identi- 
fies sets of genes that are common to a range of tumour 
types, and to the stromal components, and which might 
therefore be potential targets. It also identifies candidate 
markers for assessing the mechanism and efficacy of 
therapeutic intervention. 

Methods 

Selection and preprocessing of datasets 

Datasets were selected on the following criteria; analysis 
of primary human tumour samples, large study size, 
availability of raw data with provision of clinical annota- 
tion and genome-wide analysis using the Affymetrix 
U133 platforms (either U133A + B or U133Plus2.0). 
These datasets were identified from Gene Expression 



Doig et al. BMC Genomics 2013, 14:469 
http://www.biomedcentral.com/1471 -21 64/1 4/469 



Page 12 of 16 



Omnibus (GEO) or caArray and CEL files downloaded. 
See Table 2 for details. 

The initial analysis used six individual datasets. The 
breast cancer dataset [4] was stratified on the basis of 
molecular tumour type (basal, HER2 positive, luminal A 
or B, or normal-like), Ellis-Elston grade, and outcome. 
The colorectal dataset [68] was divided into microsatel- 
lite stable and unstable tumours. The lymphoma dataset 
[69] was stratified into germinal centre B cell-like 
(GCB), activated B cell-like (ABC), primary mediastinal 
B cell (PMBL) and unclassified, based on gene expres- 
sion and further organised on the basis of sex and 
patient outcome. The glioma dataset, derived from 
caArray, was stratified on the basis of histological type 
(astrocytoma, oligodendroglioma, glioblastoma), WHO 
grade and sex. The ovarian dataset [70] was stratified by 
malignant or low malignant potential, histological type 
(endometrioid or serous), grade, stage and primary site 
(ovary, peritoneum or fallopian tube). The testicular 
dataset [71] contained primary germ cell tumours and 
was stratified by pure or mixed histological types and 
then within each category on the constituent elements 
using the WHO classification (seminoma, teratoma, 
embyronal carcinoma, yolk sac tumour, choriocarcin- 
oma). As such these data were selected to represent a 
broad range of tumour biology. Other data derived from 
various types of skin cancer including BCC, SCC, pri- 
mary melanoma, and melanoma metastatic to subcuta- 
neous tissue, lymph node, brain and adrenal gland [38], 
gastric cancer [47] and Hodgkins lymphoma [48] were 
used as test datasets to verify the conservation of gene 
signatures in independent datasets. 

A summary of the data analysis pipeline is shown in 
Figure 7a. The quality of the raw data from each dataset 
was reanalysed using the arrayQualityMetrics package 
in Bioconductor (http://www.bioconductor.org/) and 
scored on the basis of 5 metrics, namely maplot, spatial, 



boxplot, heatmap and rle [72]. Any array failing on more 
than one metric was removed (Table 2) and in cases 
comprising A and B arrays, failure of one chip resulted 
in removal of data for that patient from analysis. Where 
the data was derived from A and B arrays, a single 
merged file was created for each sample. Following QC, 
each dataset was normalised independently using the ro- 
bust multi-array average (RMA) expression measure 
[73]. Probesets were annotated using Bioconductor (26 
June 2009) and samples ordered according to clinical 
grouping. 

Network analysis 

Each dataset was saved as an '.expression' file containing 
a unique identifier for each row of data (Gene symbol 
concatenated to probeset ID), followed by columns of 
gene annotations used as class-sets for the overlay and 
analysis of information with respect to the graph, and 
finally natural scale normalised data values for each sam- 
ple (each column of data being derived from a different 
sample). These files were then loaded into the network 
analysis tool BioLayout Express 3 ® [21]. Pairwise Pearson 
correlations were calculated for each probeset on the 
array(s) as a measure of similarity between the signal de- 
rived from different probesets. All Pearson correlations 
where r > 0.6 were saved to a '.pearson file. Graph layout 
was performed using a modified Fruchterman-Rheingold 
algorithm [74] in 3-dimensional space in which 
nodes representing genes/transcripts are connected by 
weighted, undirected edges representing correlations 
above the selected threshold. Depending on the size of 
the dataset and the inherent variation of samples, 
datasets produced graphs of varying sizes at a given cor- 
relation threshold value (Figure 7b). Selected correlation 
thresholds for individual datasets were designed to in- 
clude approximately 40% of the available data (Table 2). 
The resultant graphs were very large and highly 



Table 2 List of the cancer datasets used for this study 



Database reference 


Reference (PMID) 


Tumor type(s) 


Cases 
analysed 


Affymetrix U133 
Platform (s) 


Graph size 
(nodes) 


Graph size 
(edges) 


GSE11318 


Lenz et al. (18765795) 


DLBCL 


194 


Plus 2.0 


19,850 


614,273 


GSE1456 


Pawitan et al. (16280042) 


Breast carcinoma 


134 


A & B 


19,246 


559,761 


GSE9891 


Tothill et al. (18698038) 


Ovarian (epithelial) carcinoma 


265 


Plus 2.0 


19,415 


268,471 


GSE3218 


Korkola et al. (16424014) 


Testicular germ cell tumours 


86 


A & B 


18,934 


954,082 


GSE13294 


Jorissen et al. (19088021) 


Colorectal carcinoma 


150 


Plus 2.0 


22,687 


725,467 


caArray/rembr-00037 


REMBRANT - Repository for 
Molecular Brain Neoplasia Data 


Primary CNS tumours 


253 


Plus 2.0 


23,015 


623,591 


GSE7553 


Riker et al. (18442402) 


Skin tumours 


77 


A & B 


19,623 


600,143 


GSE17920 


Steidl et al. (20220182) 


Hodgkin lymphoma 


131 


Plus 2.0 


13,846 


521,593 


GSE15459 


Ooi et al. (19798449) 


Gastric cancer 


200 


Plus 2.0 


15/747 


719,884 



The datasets with a white background are the six used for the primary analysis and the remaining three (bold) the datasets used to confirm the robustness of the 
core cancer expression signatures. 
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a 



Cancer datasets downloaded 



Data QC, normalisation and probe annotation 



4 



(A and B chips merged) 



Pairwise Pearson correlations 
calculated and stored 



I 



Network construction of 
graphs derived from individual 
datasets using correlations 
above selected threshold 
followed by MCL clustering 



I 



Analysis and annotation of 
graph structure using external 
resources 



44K probes common 
to all arrays selected 



Pairwise Pearson correlations 
calculated and stored for 
each dataset 



Mean pairwise Pearson 
correlations calculated 



Network construction and 
MCL clustering of graph of 
mean correlations >0.6 




structured (Figure 3). The topology of the graph 
contained localised areas of high connectivity and high 
correlation (representing clusters of genes with similar 
profiles), were determined using the Markov Cluster 
(MCL) algorithm which simulates multiple iterations of 
random flow through the graph structure. An MCL in- 
flation value of 2.2 was used as the basis of determining 
the granularity of clustering, as it has been shown to be 
optimal when working with highly structured expression 
graphs [21]. Clusters were named according to their 
relative size, the largest cluster being named cluster 1. 
Graphs of each dataset were explored extensively in 
order to understand the significance of the gene clusters 
and their relevance to the pathology of the tumours. 

Cluster annotation 

Gene set enrichment analysis was performed on clusters 
using DAVID [30,75] and GSEA MSigDB [31] web- 
based analysis tools to determine the significance of co- 



expressed genes. Clusters were annotated if hits of high 
significance showed a common trend as to function. 
These analyses were supplemented by comparison of the 
clusters with tissue- and cell-specific clusters derived 
from network-based analyses of a human tissue atlas 
and an atlas of purified leukocyte populations [28,29] 
and comprehensive reviews of the literature. 

Comparison of expression patterns across Six cancers and 
validation of signatures 

To allow direct comparison of the datasets, probesets 
that were not represented in all datasets i.e. were only 
present on the U133Plus 2.0 array, were removed leaving 
44,754 probesets common to all U133 platforms. The 
Pearson correlation coefficient between data derived 
from each probe-set in individual datasets was calculated 
using BioLayout Express 3 ^ and all correlation values 
written to file. A mean Pearson correlation was then cal- 
culated for each transcript i.e. the average correlation 
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between probesets across the six datasets. Mean Pearson 
correlations were filtered to remove any values below a 
user defined threshold and network graphs constructed. 
The work described here is based on the analysis of a 
graph constructed using a mean Pearson threshold of 
r >0.6 and clustered with an MCL inflation value of 2.2. 

In order to analyse the conservation of the gene signa- 
tures in an unrelated dataset, the clusters and associated 
annotations derived from the merged r >0.6 graph were 
mapped onto to datasets derived from various types of 
skin cancer [38], gastric cancer [47] and Hodgkins 
lymphoma [48]. Graphs were constructed from these 
data and clustered, and these clusters were analysed for 
enrichment of genes from the core signatures' using 
Fisher s test with an adjustment for multiple of testing. 

Additional files 



Additional file 1: Table of macrophage gene cluster derived from 
DLBCL dataset when examined using a Pearson correlation cut off 
of >0.65 and clustered using an MCL inflation value of 2.2. 

Additional file 2: Coexpression clustering of genes associated with 
ESR1 in breast cancer dataset. As expected the expression of ESR1 
(below) shows a marked reduction in ER-negative tumours. Examination 
of the neighbours of ESR1 in the correlation network pulls out many of 
the known E5R1 targets including F0XA1, XBP1, ERBB4 and GATA3 
together with some new and interesting candidate genes. 

Additional file 3: Coexpression clustering of genes associated with 
IRF4 in DLBCL dataset. IRF4, one of the markers of the ABC-subtype [6] 
lies in a sparse network on the edge of the graph. Its nearest neighbours 
include F0XP1, PIM2 and CARD1 1, all described to be up-regulated in 
ABC-subtype of DLBCL, with amplifications or mutation affecting F0XP1 
and CARD11 identified in 38% and 10% respectively of tumours studied. 

Additional file 4: Cluster analysis of merged cancer (r = 0.6) graph. 

This table lists the gene membership and annotation of each cluster 
from the meta-analysis of gene expression data derived from multiple 
tumour types. 

Additional file 5: GO analysis of the main clusters of interest from 
the combined cancer data analysis. All graphs and tables described in 
this work are available from the website www.OncoGraph.org which 
supports the direct visualization of graphs in BioLayout Express 30 using 
Java web start technology. 
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